Briefings in Bioinformatics — Latest Matching Preprints

1

Dr. Sim: Similarity Learning for Transcriptional Phenotypic Drug discovery

Wei, Z.; Zhu, S.; Chen, X.; Zhu, C.; Duan, B.; Liu, Q.

2021-09-24 bioinformatics 10.1101/2021.09.23.461458 medRxiv

Top 0.1%

40.5%

Show abstract

Transcriptional phenotypic drug discovery has achieved great success, and various compound perturbation-based data resources, such as Connectivity Map (CMap) and Library of Integrated Network-Based Cellular Signatures (LINCS), have been presented. Computational strategies fully mining these resources for phenotypic drug discovery have been proposed, and among them, a fundamental issue is to define the proper similarity between the transcriptional profiles to elucidate the drug mechanism of actions and identify new drug indications. Traditionally, this similarity has been defined in an unsupervised way, and due to the high dimensionality and the existence of high noise in those high-throughput data, it lacks robustness with limited performance. In our study, we present Dr. Sim, which is a general learning-based framework that automatically infers similarity measurement rather than being manually designed and can be used to characterize transcriptional phenotypic profiles for drug discovery with generalized good performance. We evaluated Dr. Sim on comprehensively publicly available in vitro and in vivo datasets in drug annotation and repositioning using high-throughput transcriptional perturbation data and indicated that Dr. Sim significantly outperforms the existing methods and is proved to be a conceptual improvement by learning transcriptional similarity to facilitate the broad utility of high-throughput transcriptional perturbation data for phenotypic drug discovery. The source code and usage of Dr. Sim is available at https://github.com/bm2-lab/DrSim/.

2

MCNET: Multi-Omics Integration for Gene Regulatory Network Inference from scRNA-seq

Tiwari, A.; Trankatwar, S.

2023-06-05 genetic and genomic medicine 10.1101/2023.05.29.23290691 medRxiv

Top 0.1%

39.6%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWDeep learning has emerged as a powerful approach in various domains, including biological network analysis. This paper investigates the advancements in computational techniques for inferring gene regulatory networks (GRNs) and introduces MCNET, a state-of-the-art deep learning algorithm. MCNET integrates multi-omics data to infer GRNs and extract biologically significant representations from single-cell RNA sequencing (scRNA-seq) data. By incorporating attention mechanisms and graph convolutional networks, MCNET captures intricate regulatory relationships among genes. Extensive benchmarking on diverse scRNA-seq datasets demonstrates MCNETs superiority over existing methods in GRN inference, scRNA-seq data visualization, clustering, and simulation. Notably, MCNET accurately predicts gene regulations on cell-type marker genes in the mouse cortex, validated by epigenetic data. The introduction of MCNET paves the way for advanced analysis of scRNA-seq data and provides a powerful tool for inferring GRNs in a multi-omics context. Moreover, this paper addresses the integration of multiomics data in gene regulatory network inference, proposing MCNET as a method that efficiently analyzes and visualizes homogeneous gene regulatory networks derived from diverse omics data. The inference capability of MCNET is evaluated through extensive experiments with simulation data and applied to analyze the biological network of psychiatric disorders using human brain data.

3

VIRALpre: Genomic Foundation Model Embedding Fused withK-mer Feature for Virus Identification

Wang, Z.; Yu, Q.; Li, Y.

2024-11-15 bioinformatics 10.1101/2024.11.12.623150 medRxiv

Top 0.1%

33.6%

Show abstract

Virus, a submicroscopic infectious agent, influences all life forms. Identifying viral sequences is essential to understand their biological functions and then analyze their impacts on public health, and the development of microbial communities. For its significance, tools are developed based on various mathematical methods and algorithms. However, previous methods struggle to identify viral sequences, especially short contigs accurately since the limited information and small-scale close-set dataset. Here we propose VIRALpre, a hybrid framework combined with genomic foundation model (GFM) embedding and K-mer feature of sequences to precisely recognize viral genomic fragments. VIRALpre is empowered by the generalization competencies of GFMs, which have proven their strength in various downstream tasks, thanks to newly established large-scale training databases and Attention mechanism. On the other hand, K-mer features provide additional biological information to bridge the limitation of GFMs in classification tasks. Comprehensive experimental results demonstrate that VIRALpre significantly outperforms all the previous methods on virus identification performance by 4% in accuracy. To prove that this model is qualified when facing unique contigs to training data, BLASTn-based similarity cut-off test(setting e-value as 10-5) is done and it achieves about 10% F1-score improvement. More than well-built test datasets, new zero-shot cross-dataset tests on benchmark datasets sampling from natural environments are conducted, VIRALpre performs identify almost most viral sequences while keeping a very low False Positive Rate. Based on these solid experiments, VIRALpre has the ability to manage short-contig virus identification by truly learning the distinctions of viral sequences and hopefully act as an adviser to promote virus-related research.

4

DeepUMQA3: a web server for model quality assessment of protein complexes

Liu, J.; Liu, D.; Zhang, G.

2023-04-28 bioinformatics 10.1101/2023.04.24.538194 medRxiv

Top 0.1%

33.3%

Show abstract

Model quality assessment is a crucial part of protein structure prediction and a gateway to proper usage of models in biomedical applications. Many methods have been proposed for assessing the quality of structural models of protein monomers, but few methods for evaluating protein complex models. As protein complex structure prediction becomes a new challenge, model quality assessment methods that can provide accurate evaluation of complex structures are urgently required. Here, we present DeepUMQA3, a web server for evaluating protein complex structures using deep neural network. For an input complex structure, features are extracted from three levels of overall complex, intra-monomer, and inter-monomer, and a improved deep residual neural network is used to predict per-residue lDDT and interface residue accuracy. DeepUMQA3 ranks first in the blind test of interface residue accuracy estimation in CASP15, with Pearson, Spearman and AUC of 0.564, 0.535 and 0.755 under the lDDT measurement, which are 18.5%, 23.6% and 10.9% higher than the second-best method, respectively. DeepUMQA3 can also accurately assess the accuracy of all residues in the entire complex and distinguish high- and low-precision residues/models. The websever of DeepUMQA3 are freely available at http://zhanglab-bioinf.com/DeepUMQA_server/.

5

An extensive evaluation of single-cell RNA-Seq contrastivelearning generative networks for intrinsic cell-typesdistribution estimation

Alsaggaf, I.; Buchan, D.; Wan, C.

2025-09-17 bioinformatics 10.1101/2025.09.15.675691 medRxiv

Top 0.1%

33.1%

Show abstract

Contrastive learning has already been widely used to handle single-cell RNA-Seq data due to its outstanding performance in transforming original data distributions into hypersphere feature spaces. In this work, we conduct a large-scale empirical evaluation to investigate the generative encoder networks that are learned by five different state-of-the-art single-cell RNA-Seq contrastive learning methods. Unlike the conventional discriminative model-based cell-type prediction studies, this work is focused on the performance of contrastive learning-based generative encoder networks in terms of their capacity to estimate the intrinsic distributions of different cell-types - a fundamental property that directly affects the performance of any downstream single-cell RNA-Seq data analytics. The experimental results confirm that supervised contrastive learning-based encoder networks lead to better performance than self-supervised contrastive learning-based encoder networks, and the recently proposed Gaussian noise augmentation-based single-cell RNA-Seq contrastive learning method shows the best performance on estimating the intrinsic distribution of different cell-types.

6

Deep Learning Enhanced Tandem Repeat Variation Identification via Multi-Modal Conversion of Nanopore Reads Alignment

Liao, X.; Zhou, J.; Zhang, B.; Li, X.; Xu, X.; Li, H.; Gao, X.

2023-08-19 bioinformatics 10.1101/2023.08.17.553659 medRxiv

Top 0.1%

32.6%

Show abstract

Identification of tandem repeat (TR) variations plays a crucial role in advancing our understanding of genetic diseases, forensic analysis, evolutionary studies, and crop improvement, thereby contributing to various fields of research and practical applications. However, traditional TR identification methods are often limited to processing genomes obtained through sequence assembly and cannot directly start detection from sequencing reads. Furthermore, the inflexibility of detection mode and parameters hinders the accuracy and completeness of the identification, rendering the results unsatisfactory. These shortcomings result in existing TR variation identification methods being associated with high computational cost, limited detection sensitivity, precision and comprehensiveness. Here, we propose DeepTRs, a novel method for identifying TR variations, which enables direct TR variation identification from raw Nanopore sequencing reads and achieves high sensitivity, accuracy, and completeness results through the multi-modal conversion of Nanopore reads alignment and deep learning. Comprehensive evaluations demonstrate that DeepTRs outperform existing methods.

7

Considering Zeros in Single Cell Sequencing Data Correlation Analysis

Cai, G.; Yu, X.; Xiao, F.

2023-05-14 bioinformatics 10.1101/2023.05.13.540566 medRxiv

Top 0.1%

32.5%

Show abstract

Single-cell sequencing technology has enabled correlation analysis of genomic features at the cellular level. However, high levels of noise and sparsity in single-cell sequencing data make accurate assessment of correlations challenging. This study provides a toolkit, SCSC (https://github.com/thecailab/SCSC), for the estimation of correlation coefficients in single-cell sequencing data. It comprehensively assessed four strategies (classical, non-zero, dropout-weighted, imputation) and the impact of data features in various simulated scenarios. The study found that filtering zeros significantly improves estimation accuracy, and further improvement can be achieved by considering the drop-out probability. In addition, the study also identified data features including expression level, library size, and biological variations that affect correlation estimation.

8

DTI-CDF: a CDF model towards the prediction of DTIs based on hybrid features

Chu, Y.; Zhang, Y.; Wang, W.; Shan, X.; Wang, X.; Xiong, Y.; Wei, D.

2019-06-03 bioinformatics 10.1101/657973 medRxiv

Top 0.1%

32.3%

Show abstract

Drug-target interactions play a crucial role in target-based drug discovery and exploitation. Computational prediction of DTIs has become a popular alternative strategy to the experimental methods for identification of DTIs of which are both time and resource consuming. However, the performances of the current DTIs prediction approaches suffer from a problem of low precision and high false positive rate. In this study, we aimed to develop a novel DTIs prediction method, named DTI-CDF, for improving the prediction precision based on a cascade deep forest model which integrates hybrid features, including multiple similarity-based features extracted from the heterogeneous graph, fingerprints of drugs, and evolution information of target protein sequences. In the experiments, we built five replicates of 10 fold cross-validations under three different experimental settings of data sets, namely, corresponding DTIs values of certain drugs (SD), targets (ST), or drug-target pairs (SP) in the training set are missed, but existed in the test set. The experimental results show that our proposed approach DTI-CDF achieved significantly higher performance than the state-of-the-art methods.

9

Enhancing Vaxign-DL for Vaccine Candidate Prediction with added ESM-Generated Features

Chen, Y.; Zhang, Y.; He, Y.

2024-09-08 bioinformatics 10.1101/2024.09.04.611295 medRxiv

Top 0.1%

29.2%

Show abstract

Many vaccine design programs have been developed, including our own machine learning approaches Vaxign-ML and Vaxign-DL. Using deep learning techniques, Vaxign-DL predicts bacterial protective antigens by calculating 509 biological and biomedical features from protein sequences. In this study, we first used the protein folding ESM program to calculate a set of 1,280 features from individual protein sequences, and then utilized the new set of features separately or in combination with the traditional set of 509 features to predict protective antigens. Our result showed that the usage of ESM-derived features alone was able to accurately predict vaccine antigens with a performance similar to the orginal Vaxign-DL prediction method, and the usage of the combined ESM-derived and orginal Vaxign-DL features significantly improved the prediction performance according to a set of seven scores including specificity, sensitivity, and AUROC. To further evaluate the updated methods, we conducted a Leave-One-Pathogen-Out Validation (LOPOV) study, and found that the usage of ESM-derived features significantly improved the the prediction of vaccine antigens from 10 bacterial pathogens. This research is the first reported study demonstrating the added value of protein folding features for vaccine antigen prediction.

10

USPNet: unbiased organism-agnostic signal peptidepredictor with deep protein language model

Chen, S.; Tan, Q.; Li, J.; Li, Y.

2021-11-05 bioinformatics 10.1101/2021.11.04.467361 medRxiv

Top 0.1%

29.2%

Show abstract

Signal peptide is a short peptide located in the N-terminus of proteins. It plays an important role in targeting and transferring transmembrane proteins and secreted proteins to correct positions. Compared with traditional experimental methods to identify and discover signal peptides, the computational methods are faster and more efficient, which are more practical for the analysis of thousands or even millions of protein sequences in reality, especially for the metagenomic data. Therefore, computational tools are recently proposed to classify signal peptides and predict cleavage site positions, but most of them disregard the extreme data imbalance problem in these tasks. In addition, almost all these methods rely on additional group information of proteins to boost their performances, which, however, may not always be available. To deal with these issues, in this paper, we present Unbiased Organism-agnostic Signal Peptide Network (USPNet), a signal peptide prediction and cleavage site prediction model based on deep protein language model. We propose to use label distribution-aware margin (LDAM) loss and evolutionary scale modeling (ESM) embedding to handle data imbalance and object-dependence problems. Extensive experimental results demonstrate that the proposed method significantly outperforms all the previous methods on the classification performance. Additional study on the simulated metagenomic data further indicates that our model is a more universal and robust tool without dependency on additional group information of proteins, with the Matthews correlation coefficient improved by up to 17.5%. The proposed method will be potentially useful to discover new signal peptides from the abundant metagenomic data.

11

A Structure-based B-cell Epitope Prediction Model Through Combing Local and Global Features

Lu, S.; Li, Y.; Nan, X.; Zhang, S.

2021-07-14 bioinformatics 10.1101/2021.07.13.452188 medRxiv

Top 0.1%

28.9%

Show abstract

B-cell epitopes (BCEs) are a set of specific sites on the surface of an antigen that binds to an antibody produced by B-cell. The recognition of BCEs is a major challenge for drug design and vaccines development. Compared with experimental methods, computational approaches have strong potential for BCEs prediction at much lower cost. Moreover, most of the currently methods focus on using local information around target residue without taking the global information of the whole antigen sequence into consideration. We propose a novel deep leaning method through combing local features and global features for BCEs prediction. In our model, two parallel modules are built to extract local and global features from the antigen separately. For local features, we use Graph Convolutional Networks(GCNs) to capture information of spatial neighbors of a target residue. For global features, Attention-Based Bidirectional Long Short-Term Memory(Att-BLSTM) networks are applied to extract information from the whole antigen sequence. Then the local and global features are combined to predict BCEs. The experiments show that the proposed method achieves superior performance over the state-of-the-art BCEs prediction methods on benchmark datasets. Also, we compare the performance differences between data with or without global features. The experimental results show that global features play an important role in BCEs prediction. Our detailed case study on the BCEs prediction for SARS-Cov-2 receptor binding domain confirms that our method is effective for predicting and clustering true BCEs.

12

Accurate nucleic acid-binding residue identification based on domain-adaptive protein language model and explainable geometric deep learning

Zeng, W.; Pan, L.; Ji, B.; Xu, L.; Peng, S.

2024-12-16 bioinformatics 10.1101/2024.12.11.628078 medRxiv

Top 0.1%

28.9%

Show abstract

Protein-nucleic acid interactions play a fundamental and critical role in a wide range of life activities. Accurate identification of nucleic acid-binding residues helps to understand the intrinsic mechanisms of the interactions. However, the accuracy and interpretability of existing computational methods for recognizing nucleic acid-binding residues need to be further improved. Here, we propose a novel method called GeSite based the domain adaptive protein language model and explainable E(3)-equivariant graph convolution neural network. Prediction results across multiple benchmark test sets demonstrate that GeSite is superior or comparable to state-of-the-art prediction methods. The performance comparison on low structure similarity and newly released test proteins demonstrates the robustness and generalization of the method. Detailed experimental results suggest that the advanced performance of GeSite lies in the well-designed nucleic acid-binding protein adaptive language model. Meanwhile, interpretability analysis exposes the perception of the prediction model on various remote and close functional domains, which is the source of its discernment. The data and source code of GeSite are freely accessible at https://github.com/pengsl-lab/GeSite.

13

Systematic Evaluation of Cell Type Deconvolution Methods for Plasma Cell-free DNA

Sun, T.; Yuan, J.; Zhu, Y.; Yang, S.; Zhou, J.; Ge, X.; Qu, S.; Li, W.; Li, J. J.; Li, Y.

2024-03-29 bioinformatics 10.1101/2024.03.25.586507 medRxiv

Top 0.1%

28.7%

Show abstract

Plasma cell-free DNA (cfDNA) is derived from cellular death in various tissues. Investigating the origin of cfDNA through tissue/cell type deconvolution allows us to detect changes in tissue homeostasis that occur during disease progression or in response to treatment. Consequently, cfDNA has emerged as a valuable noninvasive biomarker for disease detection and treatment monitoring. Although there are numerous methylation-based methods of cfDNA cell type deconvolution available, a comprehensive and systematic evaluation of these methods has yet to be conducted. In this study, we thoroughly benchmarked five previously published methods: MethAtlas, cfNOMe, CelFiE, CelFEER, and UXM. Utilizing deep whole-genome bisulfite sequencing data from 35 human cell types, we generated cfDNA mixtures with known ground truth to assess the deconvolution performance under various scenarios. Our findings indicate that different factors, including sequencing depth, reference marker selection, and reference completeness, influence cell type deconvolution performance. Notably, omitting cell types present in a mixture from the reference leads to suboptimal results. Despite each method exhibited distinct performances under various scenarios, CelFEER and UXM exhibit overall superior performance compared to the others. In summary, we comprehensively evaluated factors influencing methylation-based cfDNA cell type deconvolution and proposed general guidelines to maximize the performance.

14

Unlocking hidden flaws in PPV and ACC: A step towards more reliable identification of protein complex

Huang, Y.; Wang, J.; Gong, X.

2025-03-11 bioinformatics 10.1101/2025.03.03.641161 medRxiv

Top 0.1%

28.6%

Show abstract

As classic evaluation indexes, clustering-wise predictive positive value (PPV) and accuracy (ACC) have been widely used for the detection of protein complexes ([1]). However, we identified a critical error in their calculation, which can lead to inaccurate evaluation results under most conditions. Here, we elaborate on the problem of the original indexes and propose revised indexes PPVM (PPV Modified) and ACCM, which correct the identified error. Experiments demonstrate that revised indexes achieve higher reliability. Based on the new indexes, we reevaluated three state-of-the-art computational methods for protein complex detection on five benchmarks to provide a revised baseline to facilitate comparison of performance for algorithms developed later. The code and data involved in the experimental section of this paper can be found in https://github.com/hyx-1/PPV_M-and-ACC_M.

15

Research on protein structure prediction and folding based on novel remote homologs recognition

Zhao, K.; Xia, Y.; Zhang, F.; Zhou, X.; Li, S. Z.; Zhang, G.

2022-10-20 bioinformatics 10.1101/2022.10.16.512404 medRxiv

Top 0.1%

28.2%

Show abstract

Recognition of remote homologous structures is a necessary module in AlphaFold2 and is also essential for the exploration of protein folding pathways. Here, we developed a new method, PAthreader, which identifies remote homologous structures based on the three-track alignment of distance profiles and structure profiles originated from PDB and AlphaFold DB by deep learning. Based on the identified templates, we further enhanced state-of-the-art modelling method and explored protein folding pathways based on the residue frequency distribution of homologs and the secondary structure. The results show that the average accuracy of templates identified by PAthreader is 11.6% higher than those of HHsearch on 551 nonredundant proteins. In terms of structure modelling, PAthreader improves the performance of AlphaFold2 and ranks first in CAMEO blind test for the last three months. Furthermore, we explored protein folding pathways for 37 proteins. The results are almost consistent with biological experiments for 7 proteins, and the remaining 30 human proteins have yet to be verified by biological experiments, revealing that folding information can be exploited from remote homologous structures.

16

Sequence-aware Prediction of Point Mutation-induced Effects on Protein-Protein Binding Affinity using Deep Learning

Zhuang, J.; Li, Z.; Wang, S.; Zheng, R.; Zhang, G.

2025-11-16 bioinformatics 10.1101/2025.11.15.688659 medRxiv

Top 0.1%

27.8%

Show abstract

Amino acid mutations may lead to significant changes in the binding affinity of protein complexes, thereby causing a series of cellular dysfunctions. Therefore, accurate prediction of protein-protein binding affinity changes ({Delta}{Delta}G) induced by amino acid mutations is of great importance for understanding protein-protein interactions (PPIs). In this study, we propose SAMAffinity, a protein sequence-aware deep learning architecture for predicting changes in protein-protein binding affinity caused by amino acid mutations. SAMAffinity predicts mutation-induced {Delta}{Delta}G by integrating multi-source sequence features, leveraging a Mutation-Site Identification (MSI) module to highlight local semantic shifts and a Binding-Interface Awareness (BIA) module to capture interaction changes. Benchmark evaluations on public datasets show that under the mutation-level data splitting strategy, SAMAffinity outperforms the state-of-the-art sequence-based method AttABseq by 33.3%, 72.3%, 31.8%, and 30.5% on S1131, S4169, S645, and M1101 datasets, respectively. Moreover, under the complex-level data splitting strategy, SAMAffinity surpasses the structure-based method MpbPPI by 22.9%, 22.7%, 5.0%, and 11.4% on the corresponding datasets. Beyond predictive accuracy, the strong consistency between the models predicted distribution and natural amino-acid mutation tendencies indicates that SAMAffinity effectively captures the underlying mutational landscape shaped by intrinsic biochemical and evolutionary factors. Based on this capability, SAMAffinity demonstrated strong generalization in a study of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) cases, suggesting its potential for optimizing therapeutic antibody design.

17

CAMP: a Convolutional Attention-based Neural Network for Multifaceted Peptide-protein Interaction Prediction

Lei, Y.; Li, S.; Liu, Z.; Wan, F.; Tian, T.; Li, S.; Zhao, D.; Zeng, J.

2020-11-16 bioinformatics 10.1101/2020.11.16.384784 medRxiv

Top 0.1%

27.6%

Show abstract

Peptide-protein interactions (PepPIs) are involved in various fundamental cellular functions and their identification is crucial for designing efficacious peptide therapeutics. To facilitate the peptide drug discovery process, a number of computational methods have been developed to predict peptide-protein interactions. However, most of the existing prediction approaches heavily depend on high-resolution structure data. Although several deep-learning-based frameworks have been proposed to predict compound-protein interactions or protein-protein interactions, few of them are particularly designed to specifically predict peptide-protein interactions. In this paper, We present a sequence-based Convolutional Attention-based neural network for Multifaceted prediction of Peptide-protein interactions, called CAMP, including predicting binary peptide-protein interactions and corresponding binding residues in the peptides. We also construct a benchmark dataset containing high-quality peptide-protein interaction pairs with the corresponding peptide binding residues for model training and evaluation. CAMP incorporates convolution neural network architectures and attention mechanism to fully exploit informative sequence-based features, including secondary structures, physicochemical properties, intrinsic disorder features and position-specific scoring matrix of the protein. Systematical evaluation of our benchmark dataset demonstrates that CAMP outperforms the state-of-the-art baseline methods on binary peptide-protein interaction prediction. In addition, CAMP can successfully identify the binding residues involved non-covalent interactions for peptides. These results indicate that CAMP can serve as a useful tool in peptide-protein interaction prediction and peptide binding site identification, which can thus greatly facilitate the peptide drug discovery process. The source code of CAMP can be found in https://github.com/twopin/CAMP.

18

MsPBRsP: Multi-scale Protein Binding Residues Prediction Using Language Model

Li, Y.; Lu, S.; Nan, X.; Zhang, S.; Zhou, Q.

2023-02-27 bioinformatics 10.1101/2023.02.26.528265 medRxiv

Top 0.1%

27.5%

Show abstract

Accurate prediction of protein binding residues (PBRs) from sequence is important for the understanding of cellular activity and helpful for the design of novel drug. However, experimental methods are time-consuming and expensive. In recent years, a lot of computational predictors based on machine learning and deep learning models are proposed to reduce such consumption. But those methods often use MSA tools such as PSI-BLAST or NetSurfP to generate some statistical features and enter them into predictive models as necessary supplementary input. The input generation process normally takes long time, and there is no standard to specify which and how many statistic results should be provided to a prediction model. In addition, prediction of PBRs relies on residue local context, but the most appropriate scale is undetermined. Most works pre-selected certain residue features as input and a scale size based on expertise for certain type of PBRs. In this study, we propose a general tool-free end-to-end framework that can be applied to all types of PBRs, Multi-scale Protein Binding Residues Prediction using language model (MsPBRsP). We adopt a pre-trained language model ProtTrans to save the large consumption caused by MSA tools, and use protein sequence alone as input to our model. To ease scale size uncertainty, we construct multi-size windows in attention layer and multi-size kernels in convolutional layer. We test our framework on various benchmark datasets including PBRs from protein-protein, protein-nucleotide, protein-small ligand, heterodimer, homodimer and antibody-antigen interactions. Compared with existing state-of-the-art methods, MsPBRsP achieves superior performance with less running time and higher prediction rates on every PBRs prediction task. Specifically, we boost F1 score by 27.1% and AUPRC score by 7.6% on NSP448 dataset and decrease running time from over 10 minutes to under 0.1s on average. The source code and datasets are available at https://github.com/biolushuai/MsPBRsP-for-multiple-PBRs-prediction.

19

AImmune: a new blood-based machine learning approach to improving immune profiling analysis on COVID-19 patients

Zhang, X. T.; Han, R. H.

2021-12-01 genetic and genomic medicine 10.1101/2021.11.26.21266883 medRxiv

Top 0.1%

27.4%

Show abstract

A massive number of transcriptomic profiles of blood samples from COVID-19 patients has been produced since pandemic COVID-19 begins, however, these big data from primary studies have not been well integrated by machine learning approaches. Taking advantage of modern machine learning arthrograms, we integrated and collected single cell RNA-seq (scRNA-seq) data from three independent studies, identified genes potentially available for interpretation of severity, and developed a high-performance deep learning-based deconvolution model AImmune that can predict the proportion of seven different immune cells from the bulk RNA-seq results of human peripheral mononuclear cells. This novel approach which can be used for clinical blood testing of COVID-19 on the ground that previous research shows that mRNA alternations in blood-derived PBMCs may serve as a severity indicator. Assessed on real-world data sets, the AImmune model outperformed the most recognized immune profiling model CIBERSORTx. The presented study showed the results obtained by the true scRNA-seq route can be consistently reproduced through the new approach AImmune, indicating a potential replacing the costly scRNA-seq technique for the analysis of circulating blood cells for both clinical and research purposes.

20

RNALens: Study on 5' UTR Modeling and Cell-Specificity

Mao, L.; Tian, Y.; Qian, K.-w.; Song, Y.

2025-07-20 bioinformatics 10.1101/2025.07.20.665722 medRxiv

Top 0.1%

26.8%

Show abstract

Recently, the Transformer architecture has been applied to predict the structure, function, and regulatory activity of biological sequences. Predicting the cell-specific regulatory impact of 5 untranslated regions (5 UTRs) on mRNA expression and translation remains a key challenge for rational mRNA design. Existing studies such as UTR-LM, RNABERT, and RNA-FM train transformer-based models solely on 5 UTR sequences with fixed nucleotide tokenization schemes and auxiliary structural features. These models pay less attention to the integration of broader genomic context and thermodynamic objectives, which limits their ability to generalize across diverse cell types and accurately predict both mRNA expression level (EL) and translation efficiency (TE). In this paper, we propose RNALens, a foundation model pre-trained in two stages on multispecies genomic sequences and curated 5 UTR data using masked language modeling augmented with secondary structure prediction and minimum free energy regression. RNALens employs byte-pair encoding to capture variable-length nucleotide motifs. It is then fine-tuned on high-throughput reporter assay datasets from HEK293T, PC3, and muscle tissues to yield specialized predictors for EL and TE in each cellular context. Experiment results on benchmark datasets demonstrate that RNALens achieves superior performance than existing machine learning methods for both expression and translation predictions across cell-specific and cross-context tests, offering an efficient in silico platform for guiding the design of mRNA therapeutics with precise cellular targeting.1